Kernel-based Similarity Search in Massive Graph Databases with Wavelet Trees

نویسندگان

  • Yasuo Tabei
  • Koji Tsuda
چکیده

Similarity search in databases of labeled graphs is a fundamental task in managing graph data such as XML, chemical compounds and social networks. Typically, a graph is decomposed to a set of substructures (e.g., paths, trees and subgraphs) and a similarity measure is defined via the number of common substructures. Using the representation, graphs can be stored in a document database by regarding graphs as documents and substructures as words. A graph similarity query then translates to a semi-conjunctive query that retrieves graphs sharing at least k substructures in common with the query graph. We argue that this kind of query cannot be solved efficiently by conventional inverted indexes, and develop a novel recursive search algorithm on wavelet trees (Grossi et al., SODA’03). Unlike gIndex, it does not require frequent subgraph mining for indexing. In experiments, our method was successfully applied to 25 million chemical compounds.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Graph Hybrid Summarization

One solution to process and analysis of massive graphs is summarization. Generating a high quality summary is the main challenge of graph summarization. In the aims of generating a summary with a better quality for a given attributed graph, both structural and attribute similarities must be considered. There are two measures named density and entropy to evaluate the quality of structural and at...

متن کامل

Efficient Similarity Search for Tree-Structured Data

Tree-structured data are becoming ubiquitous nowadays and manipulating them based on similarity is essential for many applications. Although similarity search on textual data has been extensively studied, searching for similar trees is still an open problem due to the high complexity of computing the similarity between trees, especially for large numbers of tress. In this paper, we propose to t...

متن کامل

Palmprint Recognition by Applying Wavelet Subband Representation and Kernel PCA

This paper presents a novel Daubechies-based kernel Principal Component Analysis (PCA) method by integrating the Daubechies wavelet representation of palm images and the kernel PCA method for palmprint recognition. The palmprint is first transformed into the wavelet domain to decompose palm images and the lowest resolution subband coefficients are chosen for palm representation. The kernel PCA ...

متن کامل

Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

We present a multi-dimensional indexing approach for fast sequence similarity search in DNA and protein databases. In particular, we propose effective transformations of subsequences into numerical vector domains and build efficient index structures on the transformed vectors. We then define distance functions in the transformed domain and examine properties of these functions. We experimentall...

متن کامل

Diffusion Hashing

With the worldwide spread of the broadband Internet, massive multimedia data including texts, images, and videos are increasing explosively and available for interactive applications over the Internet. At the same time, more and more attention has been paid to aiming at fast retrieval from massive multimedia databases. Hash-based Approximate Nearest Neighbor (ANN) search is a technology that ac...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011